BankChurners Project / Yeoman Yoon

In [ ]:
import pandas as pd
import numpy as np
from sklearn import metrics
import matplotlib.pyplot as plt
%matplotlib inline 
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn import metrics

#model
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier

#Bagging
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
import warnings
warnings.filterwarnings("ignore")

#Boosting
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.ensemble import StackingClassifier

#Tuning
from time import time
from scipy.stats import randint as sp_randint
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.datasets import load_digits
from sklearn.pipeline import Pipeline, make_pipeline

Import Data

In [ ]:
bank = pd.read_csv("BankChurners.csv") # use copy to store the original data
data = bank.copy()
In [ ]:
data.head()
In [ ]:
print(f"The data has {data.shape[0]} rows and {data.shape[1]} columns")
In [5]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10127 entries, 0 to 10126
Data columns (total 21 columns):
CLIENTNUM                   10127 non-null int64
Attrition_Flag              10127 non-null object
Customer_Age                10127 non-null int64
Gender                      10127 non-null object
Dependent_count             10127 non-null int64
Education_Level             10127 non-null object
Marital_Status              10127 non-null object
Income_Category             10127 non-null object
Card_Category               10127 non-null object
Months_on_book              10127 non-null int64
Total_Relationship_Count    10127 non-null int64
Months_Inactive_12_mon      10127 non-null int64
Contacts_Count_12_mon       10127 non-null int64
Credit_Limit                10127 non-null float64
Total_Revolving_Bal         10127 non-null int64
Avg_Open_To_Buy             10127 non-null float64
Total_Amt_Chng_Q4_Q1        10127 non-null float64
Total_Trans_Amt             10127 non-null int64
Total_Trans_Ct              10127 non-null int64
Total_Ct_Chng_Q4_Q1         10127 non-null float64
Avg_Utilization_Ratio       10127 non-null float64
dtypes: float64(5), int64(10), object(6)
memory usage: 1.6+ MB

Data does not have any missing values

Data Dictionary

Data Dictionary:

  1. CLIENTNUM: Client number. Unique identifier for the customer holding the account
  2. Attrition_Flag: Internal event (customer activity) variable - if the account is closed then "Attrited Customer" else "Existing Customer" (Target value)
  3. Customer_Age: Age in Years
  4. Gender: Gender of the account holder
  5. Dependent_count: Number of dependents
  6. Education_Level: Educational Qualification of the account holder - Graduate, High School, Unknown, Uneducated, College(refers to a college student), Post-Graduate, Doctorate.
  7. Marital_Status: Marital Status of the account holder
  8. Income_Category: Annual Income Category of the account holder
  9. Card_Category: Type of Card
  10. Months_on_book: Period of relationship with the bank
  11. Total_Relationship_Count: Total no. of products held by the customer
  12. Months_Inactive_12_mon: No. of months inactive in the last 12 months
  13. Contacts_Count_12_mon: No. of Contacts between the customer and bank in the last 12 months
  14. Credit_Limit: Credit Limit on the Credit Card
  15. Total_Revolving_Bal: The balance that carries over from one month to the next is the revolving balance
  16. Avg_Open_To_Buy: Open to Buy refers to the amount left on the credit card to use (Average of last 12 months)
  17. Total_Trans_Amt: Total Transaction Amount (Last 12 months)
  18. Total_Trans_Ct: Total Transaction Count (Last 12 months)
  19. Total_Ct_Chng_Q4_Q1: Ratio of the total transaction count in 4th quarter and the total transaction count in 1st quarter
  20. Total_Amt_Chng_Q4_Q1: Ratio of the total transaction amount in 4th quarter and the total transaction amount in 1st quarter
  21. Avg_Utilization_Ratio: Represents how much of the available credit the customer spent
In [6]:
data.drop(["CLIENTNUM"], axis=1, inplace = True) # Drop customer unique info, we are not using this
In [7]:
data.describe().T
Out[7]:
count mean std min 25% 50% 75% max
Customer_Age 10127.0 46.325960 8.016814 26.0 41.000 46.000 52.000 73.000
Dependent_count 10127.0 2.346203 1.298908 0.0 1.000 2.000 3.000 5.000
Months_on_book 10127.0 35.928409 7.986416 13.0 31.000 36.000 40.000 56.000
Total_Relationship_Count 10127.0 3.812580 1.554408 1.0 3.000 4.000 5.000 6.000
Months_Inactive_12_mon 10127.0 2.341167 1.010622 0.0 2.000 2.000 3.000 6.000
Contacts_Count_12_mon 10127.0 2.455317 1.106225 0.0 2.000 2.000 3.000 6.000
Credit_Limit 10127.0 8631.953698 9088.776650 1438.3 2555.000 4549.000 11067.500 34516.000
Total_Revolving_Bal 10127.0 1162.814061 814.987335 0.0 359.000 1276.000 1784.000 2517.000
Avg_Open_To_Buy 10127.0 7469.139637 9090.685324 3.0 1324.500 3474.000 9859.000 34516.000
Total_Amt_Chng_Q4_Q1 10127.0 0.759941 0.219207 0.0 0.631 0.736 0.859 3.397
Total_Trans_Amt 10127.0 4404.086304 3397.129254 510.0 2155.500 3899.000 4741.000 18484.000
Total_Trans_Ct 10127.0 64.858695 23.472570 10.0 45.000 67.000 81.000 139.000
Total_Ct_Chng_Q4_Q1 10127.0 0.712222 0.238086 0.0 0.582 0.702 0.818 3.714
Avg_Utilization_Ratio 10127.0 0.274894 0.275691 0.0 0.023 0.176 0.503 0.999
In [8]:
data.nunique().sort_values(ascending=False)
Out[8]:
Avg_Open_To_Buy             6813
Credit_Limit                6205
Total_Trans_Amt             5033
Total_Revolving_Bal         1974
Total_Amt_Chng_Q4_Q1        1158
Avg_Utilization_Ratio        964
Total_Ct_Chng_Q4_Q1          830
Total_Trans_Ct               126
Customer_Age                  45
Months_on_book                44
Education_Level                7
Contacts_Count_12_mon          7
Months_Inactive_12_mon         7
Dependent_count                6
Total_Relationship_Count       6
Income_Category                6
Marital_Status                 4
Card_Category                  4
Gender                         2
Attrition_Flag                 2
dtype: int64

Datatype Conversion

For the memory purpose, Convert object to category.

In [9]:
to_category = ['Education_Level', 'Marital_Status', 'Income_Category', 'Card_Category', 'Gender', 
               'Total_Relationship_Count', 'Months_Inactive_12_mon', 'Contacts_Count_12_mon', 'Dependent_count']

for col in to_category:
    data[col]=data[col].astype('category')

Since the target value (Attrition_Flag) has only 2 categories, It is reasonable to convert them to 0 and 1

In [10]:
data['Attrition_Flag'] = data['Attrition_Flag'].replace('Attrited Customer',0)
data['Attrition_Flag'] = data['Attrition_Flag'].replace('Existing Customer',1)
In [11]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10127 entries, 0 to 10126
Data columns (total 20 columns):
Attrition_Flag              10127 non-null int64
Customer_Age                10127 non-null int64
Gender                      10127 non-null category
Dependent_count             10127 non-null category
Education_Level             10127 non-null category
Marital_Status              10127 non-null category
Income_Category             10127 non-null category
Card_Category               10127 non-null category
Months_on_book              10127 non-null int64
Total_Relationship_Count    10127 non-null category
Months_Inactive_12_mon      10127 non-null category
Contacts_Count_12_mon       10127 non-null category
Credit_Limit                10127 non-null float64
Total_Revolving_Bal         10127 non-null int64
Avg_Open_To_Buy             10127 non-null float64
Total_Amt_Chng_Q4_Q1        10127 non-null float64
Total_Trans_Amt             10127 non-null int64
Total_Trans_Ct              10127 non-null int64
Total_Ct_Chng_Q4_Q1         10127 non-null float64
Avg_Utilization_Ratio       10127 non-null float64
dtypes: category(9), float64(5), int64(6)
memory usage: 961.6 KB
In [12]:
all_col = ['Attrition_Flag','Customer_Age','Gender','Dependent_count','Education_Level','Marital_Status',
           'Income_Category','Card_Category','Months_on_book','Total_Relationship_Count','Months_Inactive_12_mon',
           'Contacts_Count_12_mon','Credit_Limit','Total_Revolving_Bal','Avg_Open_To_Buy','Total_Amt_Chng_Q4_Q1',
           'Total_Trans_Amt','Total_Trans_Ct','Total_Ct_Chng_Q4_Q1','Avg_Utilization_Ratio']

for col in all_col:
    print(f"{col} = {data[col].unique()}")
    print()
Attrition_Flag = [1 0]

Customer_Age = [45 49 51 40 44 32 37 48 42 65 56 35 57 41 61 47 62 54 59 63 53 58 55 66
 50 38 46 52 39 43 64 68 67 60 73 70 36 34 33 26 31 29 30 28 27]

Gender = [M, F]
Categories (2, object): [M, F]

Dependent_count = [3, 5, 4, 2, 0, 1]
Categories (6, int64): [3, 5, 4, 2, 0, 1]

Education_Level = [High School, Graduate, Uneducated, Unknown, College, Post-Graduate, Doctorate]
Categories (7, object): [High School, Graduate, Uneducated, Unknown, College, Post-Graduate, Doctorate]

Marital_Status = [Married, Single, Unknown, Divorced]
Categories (4, object): [Married, Single, Unknown, Divorced]

Income_Category = [$60K - $80K, Less than $40K, $80K - $120K, $40K - $60K, $120K +, Unknown]
Categories (6, object): [$60K - $80K, Less than $40K, $80K - $120K, $40K - $60K, $120K +, Unknown]

Card_Category = [Blue, Gold, Silver, Platinum]
Categories (4, object): [Blue, Gold, Silver, Platinum]

Months_on_book = [39 44 36 34 21 46 27 31 54 30 48 37 56 42 49 33 28 38 41 43 45 52 40 50
 35 47 32 20 29 25 53 24 55 23 22 26 13 51 19 15 17 18 16 14]

Total_Relationship_Count = [5, 6, 4, 3, 2, 1]
Categories (6, int64): [5, 6, 4, 3, 2, 1]

Months_Inactive_12_mon = [1, 4, 2, 3, 6, 0, 5]
Categories (7, int64): [1, 4, 2, 3, 6, 0, 5]

Contacts_Count_12_mon = [3, 2, 0, 1, 4, 5, 6]
Categories (7, int64): [3, 2, 0, 1, 4, 5, 6]

Credit_Limit = [12691.  8256.  3418. ...  5409.  5281. 10388.]

Total_Revolving_Bal = [ 777  864    0 ...  534  476 2241]

Avg_Open_To_Buy = [11914.  7392.  3418. ... 11831.  5409.  8427.]

Total_Amt_Chng_Q4_Q1 = [1.335 1.541 2.594 ... 0.222 0.204 0.166]

Total_Trans_Amt = [ 1144  1291  1887 ... 10291  8395 10294]

Total_Trans_Ct = [ 42  33  20  28  24  31  36  32  26  17  29  27  21  30  16  18  23  22
  40  38  25  43  37  19  35  15  41  57  12  14  34  44  13  47  10  39
  53  50  52  48  49  45  11  55  46  54  60  51  63  58  59  61  78  64
  65  62  67  66  56  69  71  75  74  76  84  82  88  68  70  73  86  72
  79  80  85  81  87  83  91  89  77 103  93  96  99  92  90  94  95  98
 100 102  97 101 104 105 106 107 109 118 108 122 113 112 111 127 114 124
 110 120 125 121 117 126 134 116 119 129 131 115 128 139 123 130 138 132]

Total_Ct_Chng_Q4_Q1 = [1.625 3.714 2.333 2.5   0.846 0.722 0.714 1.182 0.882 0.68  1.364 3.25
 2.    0.611 1.7   0.929 1.143 0.909 0.6   1.571 0.353 0.75  0.833 1.3
 1.    0.9   2.571 1.6   1.667 0.483 1.176 1.2   0.556 0.143 0.474 0.917
 1.333 0.588 0.8   1.923 0.25  0.364 1.417 1.083 1.25  0.5   1.154 0.733
 0.667 2.4   1.05  0.286 0.4   0.522 0.435 1.875 0.966 1.412 0.526 0.818
 1.8   1.636 2.182 0.619 0.933 1.222 0.304 0.727 0.385 1.5   0.789 0.542
 1.1   1.095 0.824 0.391 0.346 3.    1.056 1.118 0.786 0.625 1.533 0.382
 0.355 0.765 0.778 2.2   1.545 0.7   1.211 1.231 0.636 0.455 2.875 1.308
 0.467 1.909 0.571 0.812 2.429 0.706 2.167 0.263 0.429 2.286 0.828 1.467
 0.478 0.867 0.88  1.444 1.273 0.941 0.684 0.591 0.762 0.529 0.615 0.519
 0.421 0.947 1.167 1.105 0.737 1.263 0.538 1.071 0.357 0.407 0.923 1.455
 0.35  2.273 0.69  0.65  0.167 0.647 1.615 0.545 0.875 1.125 0.462 1.294
 1.357 3.5   1.067 1.286 0.524 1.214 0.273 1.538 0.783 0.235 0.607 2.083
 0.632 0.368 0.444 0.76  0.536 0.438 0.423 2.1   0.565 0.719 0.182 1.75
 0.944 0.581 0.333 0.643 0.87  0.692 1.227 0.938 1.833 0.652 1.462 0.583
 0.679 0.375 1.091 2.75  1.385 1.188 0.261 1.312 0.656 1.235 0.958 0.37
 0.059 0.3   0.613 1.778 0.955 0.864 1.429 0.889 1.438 0.481 0.452 1.13
 0.562 1.048 0.409 0.622 0.688 1.217 0.211 0.606 0.655 0.381 1.053 1.316
 0.575 0.85  0.41  0.609 1.579 0.56  0.276 0.533 0.515 0.308 0.852 0.371
 0.214 0.63  0.231 0.406 0.405 0.349 0.857 0.212 0.543 1.059 0.579 0.387
 0.724 0.415 0.895 0.781 0.412 0.649 0.32  0.345 0.367 0.586 0.324 0.306
 0.676 0.708 0.476 0.29  0.55  0.133 0.344 0.52  0.471 0.842 0.654 0.516
 0.464 1.857 0.629 0.963 0.686 0.323 0.585 0.633 0.92  0.441 0.424 0.59
 0.763 0.207 0.314 2.222 1.45  0.469 3.571 0.696 0.741 0.512 1.043 0.568
 0.548 0.194 0.552 0.448 0.651 0.393 0.657 0.682 0.808 1.032 0.577 0.241
 0.425 0.348 0.318 0.292 0.312 0.486 0.969 0.697 0.389 0.44  0.829 0.677
 0.189 0.259 0.72  0.815 1.15  0.806 0.537 0.721 0.531 0.472 0.594 0.773
 0.826 0.906 0.417 0.758 1.107 0.621 0.458 0.267 0.107 0.459 0.71  0.487
 0.95  0.321 0.414 0.742 0.739 0.767 0.394 0.091 0.926 0.618 0.784 0.208
 1.136 0.897 0.593 0.294 0.718 1.375 0.862 0.439 0.839 0.595 1.208 0.96
 0.514 0.433 0.484 1.08  0.931 0.233 0.971 0.957 1.038 0.48  0.731 1.474
 1.062 0.608 1.103 1.111 0.725 1.647 0.774 0.477 0.238 0.967 0.769 0.576
 0.567 1.042 0.759 0.81  1.069 0.574 0.528 0.278 0.703 0.447 0.028 0.297
 1.037 0.269 0.962 0.905 0.111 0.513 0.31  0.614 0.436 0.45  1.48  0.296
 0.879 1.114 0.262 1.278 0.257 0.517 1.36  0.605 1.04  0.711 0.844 0.623
 0.913 0.756 1.045 0.775 0.645 0.793 0.488 0.511 0.811 0.838 0.641 0.646
 0.972 0.559 0.659 0.525 0.038 0.871 0.919 0.179 0.639 0.077 0.564 0.419
 0.853 0.64  0.848 1.033 0.351 0.675 0.743 0.952 1.077 1.087 1.12  0.885
 0.592 0.893 0.265 1.292 0.457 0.771 0.977 0.053 1.318 0.809 0.674 0.968
 0.316 0.15  0.558 0.485 0.735 0.275 0.19  1.381 0.379 0.689 0.561 0.174
 0.217 1.174 0.766 0.683 0.    0.281 0.28  0.492 0.788 0.865 0.881 0.794
 0.712 0.658 0.891 1.24  0.911 0.946 0.2   0.465 0.489 0.541 0.86  0.628
 0.062 0.795 1.722 0.892 0.578 0.704 0.732 0.587 0.956 0.185 0.341 0.58
 0.378 1.036 0.549 0.491 0.702 0.638 0.176 0.912 0.535 0.521 0.653 0.604
 0.73  0.66  1.139 0.509 1.882 0.463 0.634 0.694 1.148 0.757 1.35  0.362
 0.822 0.755 0.395 0.861 0.738 1.133 0.872 0.886 1.156 0.532 1.03  0.453
 0.821 1.034 0.635 0.154 0.903 1.207 1.31  0.523 0.878 0.744 0.317 0.93
 0.24  0.804 0.761 0.54  0.479 0.551 1.4   0.553 0.426 0.816 0.698 0.227
 0.896 0.792 1.051 0.61  0.884 0.408 0.617 0.935 0.361 0.902 0.78  0.841
 0.796 0.975 1.081 0.707 0.422 0.964 0.172 0.805 0.717 0.347 1.138 0.791
 0.681 0.256 1.609 0.868 0.468 0.432 1.121 0.787 0.596 0.976 1.158 1.028
 0.949 0.451 0.456 0.837 1.212 0.673 0.222 0.171 0.51  0.685 0.396 0.388
 0.644 0.914 1.476 0.46  0.547 1.421 0.825 0.729 0.723 1.471 0.939 0.974
 0.943 0.84  0.627 0.13  1.147 0.327 1.065 0.705 1.37  0.854 0.951 0.569
 0.921 0.776 0.927 0.449 0.475 0.97  1.097 0.612 1.024 1.088 0.648 0.242
 0.661 0.745 1.522 0.843 0.907 1.027 1.783 0.62  0.814 1.026 0.851 1.094
 0.431 1.057 0.226 0.736 0.103 1.29  0.925 0.566 0.161 0.303 1.152 1.65
 0.74  1.194 1.226 0.642 1.323 1.025 1.074 0.508 0.49  0.534 0.83  0.978
 1.206 1.054 0.936 0.932 0.105 1.061 1.031 1.478 0.898 0.672 0.188 0.518
 0.953 1.049 1.086 0.691 0.411 1.029 1.419 1.075 0.206 0.973 1.219 1.162
 0.827 1.321 0.343 0.764 0.125 0.119 1.189 1.179 1.258 1.229 1.073 0.074
 1.458 1.172 1.32  1.108 1.16  0.36  1.391 1.583 0.147 1.115 0.359 1.128
 0.915 0.282 0.162 1.303 0.582 1.382 1.171 0.029 1.161 0.192 1.346 0.473
 0.097 0.82  0.557 0.894 1.135 1.367 1.023 0.544 0.589 0.603 0.442 0.295
 0.434 0.554 0.372 0.527 0.709 0.782 0.797 0.695 0.849 0.768 0.863 0.746
 0.597 0.631 0.678 0.887 0.754 0.687 0.699 0.873 0.716 0.934 0.847 0.244
 0.803 0.772 0.859 1.064 0.819 0.573 0.807 0.79  0.817 0.785 0.823 0.836
 0.616 0.831 1.06  1.122 0.866 0.662 0.869 0.779 0.981 0.293 0.855 0.98
 0.671 1.079 0.693 0.77  1.093 1.018 1.022 0.734 0.753 0.726 0.922 0.948
 1.684 0.918]

Avg_Utilization_Ratio = [0.061 0.105 0.    0.76  0.311 0.066 0.048 0.113 0.144 0.217 0.174 0.195
 0.279 0.23  0.078 0.095 0.788 0.08  0.086 0.152 0.626 0.215 0.093 0.099
 0.285 0.658 0.69  0.282 0.562 0.135 0.544 0.757 0.241 0.077 0.018 0.355
 0.145 0.209 0.793 0.074 0.259 0.591 0.687 0.127 0.667 0.843 0.422 0.156
 0.525 0.587 0.211 0.088 0.111 0.044 0.276 0.704 0.656 0.053 0.051 0.467
 0.698 0.067 0.079 0.287 0.36  0.256 0.719 0.198 0.14  0.035 0.619 0.108
 0.062 0.765 0.963 0.524 0.347 0.45  0.232 0.299 0.085 0.059 0.43  0.62
 0.027 0.169 0.058 0.223 0.057 0.513 0.473 0.047 0.106 0.05  0.03  0.615
 0.15  0.407 0.191 0.096 0.176 0.83  0.412 0.678 0.246 0.271 0.114 0.395
 0.406 0.258 0.178 0.941 0.141 0.118 0.119 0.64  0.432 0.612 0.359 0.309
 0.101 0.607 0.512 0.806 0.463 0.77  0.076 0.133 0.037 0.146 0.171 0.069
 0.837 0.055 0.294 0.39  0.19  0.692 0.503 0.251 0.11  0.087 0.214 0.164
 0.049 0.043 0.679 0.098 0.694 0.039 0.199 0.22  0.13  0.202 0.319 0.165
 0.863 0.665 0.598 0.539 0.472 0.064 0.16  0.42  0.713 0.092 0.336 0.666
 0.147 0.987 0.073 0.88  0.28  0.65  0.761 0.072 0.327 0.459 0.252 0.244
 0.291 0.46  0.489 0.482 0.24  0.197 0.866 0.317 0.762 0.162 0.196 0.734
 0.446 0.262 0.042 0.094 0.308 0.68  0.238 0.753 0.877 0.724 0.117 0.638
 0.102 0.131 0.255 0.716 0.609 0.405 0.154 0.605 0.275 0.06  0.07  0.186
 0.648 0.167 0.153 0.79  0.732 0.123 0.221 0.2   0.063 0.785 0.771 0.224
 0.795 0.187 0.583 0.316 0.447 0.625 0.514 0.557 0.955 0.867 0.846 0.756
 0.31  0.373 0.935 0.155 0.435 0.932 0.829 0.953 0.188 0.82  0.616 0.595
 0.521 0.268 0.09  0.885 0.546 0.569 0.183 0.639 0.329 0.274 0.161 0.865
 0.73  0.134 0.137 0.478 0.361 0.312 0.036 0.243 0.805 0.168 0.103 0.179
 0.529 0.227 0.706 0.075 0.804 0.708 0.766 0.381 0.046 0.428 0.112 0.041
 0.85  0.517 0.72  0.056 0.548 0.436 0.201 0.523 0.081 0.403 0.671 0.752
 0.194 0.657 0.476 0.729 0.911 0.78  0.35  0.636 0.632 0.226 0.798 0.781
 0.148 0.029 0.12  0.651 0.257 0.204 0.231 0.18  0.617 0.458 0.142 0.054
 0.374 0.491 0.216 0.572 0.32  0.212 0.545 0.314 0.393 0.599 0.33  0.663
 0.159 0.185 0.371 0.506 0.448 0.128 0.269 0.333 0.125 0.091 0.53  0.303
 0.682 0.456 0.584 0.337 0.51  0.819 0.543 0.81  0.189 0.213 0.068 0.033
 0.261 0.071 0.41  0.712 0.515 0.593 0.203 0.286 0.457 0.654 0.122 0.345
 0.825 0.1   0.206 0.976 0.17  0.292 0.139 0.109 0.278 0.324 0.745 0.402
 0.397 0.045 0.177 0.611 0.284 0.578 0.318 0.803 0.594 0.684 0.019 0.722
 0.032 0.115 0.511 0.306 0.104 0.219 0.709 0.621 0.082 0.553 0.465 0.707
 0.166 0.859 0.677 0.253 0.586 0.425 0.801 0.084 0.645 0.149 0.343 0.878
 0.304 0.814 0.342 0.848 0.163 0.222 0.469 0.519 0.272 0.325 0.702 0.181
 0.693 0.809 0.479 0.468 0.356 0.811 0.34  0.63  0.372 0.637 0.507 0.749
 0.129 0.674 0.794 0.582 0.464 0.065 0.315 0.691 0.501 0.218 0.56  0.175
 0.5   0.378 0.613 0.313 0.727 0.239 0.603 0.57  0.27  0.034 0.247 0.737
 0.124 0.589 0.534 0.237 0.136 0.789 0.777 0.52  0.653 0.016 0.346 0.721
 0.675 0.138 0.266 0.442 0.326 0.301 0.717 0.023 0.025 0.25  0.281 0.796
 0.296 0.334 0.471 0.571 0.352 0.143 0.608 0.775 0.67  0.321 0.696 0.689
 0.624 0.408 0.157 0.439 0.672 0.302 0.225 0.357 0.527 0.431 0.831 0.755
 0.786 0.026 0.659 0.416 0.451 0.052 0.404 0.394 0.391 0.736 0.854 0.791
 0.126 0.363 0.874 0.297 0.341 0.344 0.208 0.733 0.234 0.116 0.828 0.365
 0.182 0.384 0.526 0.396 0.031 0.516 0.748 0.354 0.349 0.233 0.497 0.248
 0.339 0.132 0.588 0.764 0.705 0.575 0.536 0.021 0.205 0.835 0.549 0.74
 0.889 0.083 0.596 0.735 0.827 0.522 0.711 0.377 0.351 0.242 0.366 0.697
 0.328 0.778 0.743 0.492 0.715 0.623 0.488 0.263 0.568 0.089 0.779 0.47
 0.264 0.415 0.58  0.452 0.289 0.635 0.229 0.75  0.695 0.6   0.784 0.173
 0.822 0.812 0.265 0.574 0.475 0.295 0.662 0.3   0.566 0.994 0.669 0.04
 0.856 0.532 0.461 0.559 0.331 0.602 0.445 0.466 0.597 0.646 0.474 0.305
 0.556 0.742 0.631 0.718 0.606 0.647 0.758 0.644 0.499 0.873 0.245 0.487
 0.558 0.49  0.121 0.869 0.797 0.437 0.772 0.7   0.934 0.857 0.015 0.547
 0.353 0.699 0.495 0.409 0.29  0.293 0.494 0.477 0.235 0.894 0.417 0.881
 0.207 0.928 0.484 0.852 0.038 0.228 0.643 0.655 0.283 0.642 0.581 0.379
 0.542 0.579 0.434 0.44  0.535 0.913 0.776 0.551 0.401 0.273 0.172 0.375
 0.714 0.668 0.362 0.833 0.633 0.783 0.614 0.763 0.844 0.744 0.61  0.453
 0.481 0.563 0.418 0.399 0.348 0.59  0.413 0.498 0.267 0.398 0.386 0.815
 0.249 0.429 0.799 0.751 0.821 0.323 0.107 0.807 0.816 0.99  0.573 0.449
 0.883 0.768 0.925 0.773 0.38  0.604 0.411 0.832 0.184 0.438 0.552 0.792
 0.376 0.641 0.37  0.158 0.426 0.277 0.493 0.629 0.02  0.236 0.21  0.726
 0.531 0.92  0.949 0.628 0.731 0.518 0.358 0.554 0.893 0.943 0.944 0.601
 0.307 0.725 0.368 0.924 0.661 0.151 0.769 0.576 0.424 0.664 0.024 0.922
 0.537 0.884 0.483 0.462 0.899 0.622 0.013 0.954 0.683 0.192 0.774 0.824
 0.858 0.984 0.414 0.561 0.879 0.504 0.509 0.968 0.918 0.836 0.332 0.028
 0.938 0.541 0.48  0.533 0.528 0.254 0.423 0.288 0.369 0.93  0.813 0.915
 0.364 0.688 0.902 0.868 0.942 0.567 0.022 0.703 0.585 0.906 0.754 0.855
 0.839 0.681 0.298 0.872 0.455 0.929 0.008 0.388 0.912 0.322 0.853 0.454
 0.685 0.747 0.66  0.904 0.738 0.485 0.496 0.577 0.927 0.746 0.565 0.634
 0.887 0.845 0.951 0.444 0.427 0.962 0.564 0.851 0.897 0.876 0.421 0.012
 0.649 0.759 0.84  0.842 0.87  0.097 0.983 0.433 0.387 0.441 0.767 0.903
 0.592 0.895 0.896 0.652 0.8   0.4   0.017 0.862 0.849 0.676 0.999 0.921
 0.673 0.948 0.004 0.864 0.55  0.787 0.392 0.443 0.505 0.538 0.71  0.367
 0.985 0.741 0.826 0.817 0.508 0.26  0.618 0.94  0.916 0.823 0.9   0.193
 0.419 0.841 0.919 0.959 0.723 0.486 0.54  0.905 0.385 0.782 0.006 0.502
 0.802 0.875 0.931 0.011 0.926 0.728 0.382 0.335 0.891 0.871 0.701 0.739
 0.383 0.834 0.898 0.389 0.901 0.988 0.907 0.686 0.86  0.882 0.861 0.917
 0.555 0.808 0.338 0.96  0.972 0.01  0.847 0.964 0.886 0.995 0.818 0.958
 0.627 0.992 0.952 0.91  0.978 0.973 0.971 0.945 0.914 0.977 0.956 0.909
 0.005 0.007 0.014 0.009]

In [13]:
small_num_col = ['Customer_Age','Credit_Limit','Total_Revolving_Bal','Avg_Open_To_Buy','Total_Amt_Chng_Q4_Q1',
                 'Total_Trans_Amt','Total_Trans_Ct','Total_Ct_Chng_Q4_Q1','Avg_Utilization_Ratio']
big_num_col = ['Dependent_count','Months_on_book','Total_Relationship_Count','Months_Inactive_12_mon',
               'Contacts_Count_12_mon']
cat_col = ['Education_Level', 'Marital_Status', 'Income_Category', 'Card_Category', 'Gender', 'Attrition_Flag']
In [14]:
for i in small_num_col:
    sns.distplot(data[i])
    plt.show()
In [15]:
for i in big_num_col:
    sns.countplot(data[i])
    plt.show()
In [16]:
for i in cat_col:
    sns.countplot(data[i])
    plt.show()

Missing Value Treatment

Although there is no Nan values, unknown values are considered to be missing values. However, since they are still accounted and recorded in data collecting process, unknown data may represent as another category (may reflect customers' personalities etc.). Hence, no missing treatment handling is done.

Error and Outlier Treatment

After soft EDA, couple of errors/outliers were detected.

  1. For Credit Limit, $34,516 is the maximum credit limit.
  2. Months_on_book has too many data that is 36
  3. Total_Amt_Chng_Q4_Q1 and Total_Ct_Chng_Q4_Q1 have outliers (>3) (DROP)
In [17]:
data[data['Credit_Limit']>33000]
Out[17]:
Attrition_Flag Customer_Age Gender Dependent_count Education_Level Marital_Status Income_Category Card_Category Months_on_book Total_Relationship_Count Months_Inactive_12_mon Contacts_Count_12_mon Credit_Limit Total_Revolving_Bal Avg_Open_To_Buy Total_Amt_Chng_Q4_Q1 Total_Trans_Amt Total_Trans_Ct Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio
6 1 51 M 4 Unknown Married $120K + Gold 46 6 1 3 34516.0 2264 32252.0 1.975 1330 31 0.722 0.066
45 1 49 M 4 Uneducated Single $80K - $120K Blue 30 3 2 3 34516.0 0 34516.0 1.621 1444 28 1.333 0.000
61 0 48 M 2 Graduate Married $60K - $80K Silver 35 2 4 4 34516.0 0 34516.0 0.763 691 15 0.500 0.000
65 1 51 M 4 Uneducated Single $80K - $120K Silver 38 4 1 4 34516.0 1515 33001.0 0.592 1293 32 0.600 0.044
70 1 51 M 4 Graduate Single $120K + Blue 42 3 2 3 34516.0 1763 32753.0 1.266 1550 41 1.050 0.051
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
10088 1 45 M 2 Graduate Single $60K - $80K Silver 33 4 2 2 34516.0 1529 32987.0 0.609 13940 105 0.810 0.044
10095 1 46 M 3 Unknown Married $80K - $120K Blue 33 4 1 3 34516.0 1099 33417.0 0.816 15490 110 0.618 0.032
10098 0 55 M 3 Graduate Single $120K + Silver 36 4 3 4 34516.0 0 34516.0 1.007 9931 70 0.750 0.000
10110 1 56 M 1 Graduate Single $80K - $120K Silver 49 5 2 2 34516.0 1091 33425.0 0.640 15274 108 0.714 0.032
10112 0 33 M 2 College Married $120K + Gold 20 2 1 4 34516.0 0 34516.0 1.004 9338 73 0.622 0.000

550 rows × 20 columns

In [18]:
data[data['Months_on_book']==36]
Out[18]:
Attrition_Flag Customer_Age Gender Dependent_count Education_Level Marital_Status Income_Category Card_Category Months_on_book Total_Relationship_Count Months_Inactive_12_mon Contacts_Count_12_mon Credit_Limit Total_Revolving_Bal Avg_Open_To_Buy Total_Amt_Chng_Q4_Q1 Total_Trans_Amt Total_Trans_Ct Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio
2 1 51 M 3 Graduate Married $80K - $120K Blue 36 4 1 0 3418.0 0 3418.0 2.594 1887 20 2.333 0.000
5 1 44 M 2 Graduate Married $40K - $60K Blue 36 3 1 2 4010.0 1247 2763.0 1.376 1088 24 0.846 0.311
8 1 37 M 3 Uneducated Single $60K - $80K Blue 36 5 2 0 22352.0 2517 19835.0 3.355 1350 24 1.182 0.113
9 1 48 M 2 Graduate Single $80K - $120K Blue 36 6 3 3 11656.0 1677 9979.0 1.524 1441 32 0.882 0.144
12 1 56 M 1 College Single $80K - $120K Blue 36 3 6 0 11751.0 0 11751.0 3.397 1539 17 3.250 0.000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
10115 1 38 M 1 Uneducated Single $40K - $60K Blue 36 2 3 2 5639.0 1558 4081.0 0.614 16628 109 0.946 0.276
10116 1 46 M 5 College Single $80K - $120K Blue 36 1 2 3 13187.0 2241 10946.0 0.689 15354 112 0.931 0.170
10118 0 50 M 1 Unknown Unknown $80K - $120K Blue 36 6 3 4 9959.0 952 9007.0 0.825 10310 63 1.100 0.096
10124 0 44 F 1 High School Married Less than $40K Blue 36 5 3 4 5409.0 0 5409.0 0.819 10291 60 0.818 0.000
10125 0 30 M 2 Graduate Unknown $40K - $60K Blue 36 4 3 3 5281.0 0 5281.0 0.535 8395 62 0.722 0.000

2463 rows × 20 columns

In [19]:
data[data['Total_Amt_Chng_Q4_Q1']>3]
Out[19]:
Attrition_Flag Customer_Age Gender Dependent_count Education_Level Marital_Status Income_Category Card_Category Months_on_book Total_Relationship_Count Months_Inactive_12_mon Contacts_Count_12_mon Credit_Limit Total_Revolving_Bal Avg_Open_To_Buy Total_Amt_Chng_Q4_Q1 Total_Trans_Amt Total_Trans_Ct Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio
8 1 37 M 3 Uneducated Single $60K - $80K Blue 36 5 2 0 22352.0 2517 19835.0 3.355 1350 24 1.182 0.113
12 1 56 M 1 College Single $80K - $120K Blue 36 3 6 0 11751.0 0 11751.0 3.397 1539 17 3.250 0.000
In [20]:
data[data['Total_Ct_Chng_Q4_Q1']>3]
Out[20]:
Attrition_Flag Customer_Age Gender Dependent_count Education_Level Marital_Status Income_Category Card_Category Months_on_book Total_Relationship_Count Months_Inactive_12_mon Contacts_Count_12_mon Credit_Limit Total_Revolving_Bal Avg_Open_To_Buy Total_Amt_Chng_Q4_Q1 Total_Trans_Amt Total_Trans_Ct Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio
1 1 49 F 5 Graduate Single Less than $40K Blue 44 6 1 2 8256.0 864 7392.0 1.541 1291 33 3.714 0.105
12 1 56 M 1 College Single $80K - $120K Blue 36 3 6 0 11751.0 0 11751.0 3.397 1539 17 3.250 0.000
269 1 54 M 5 Graduate Married $60K - $80K Blue 38 3 3 3 2290.0 1434 856.0 0.923 1119 18 3.500 0.626
773 1 61 M 0 Post-Graduate Married Unknown Blue 53 6 2 3 14434.0 1927 12507.0 2.675 1731 32 3.571 0.134
In [21]:
data=data.drop([1,8,12,269,773])

Univariate Conclusion:

  1. The lowest age of the customer is 26.
  2. Months_on_book has too many data that is 36. Either some big promotion happened 3 years or some measurement stopped recording after 3 years.
  3. We have more existing customer. Consider the data balance
  4. Serious imbalance in Card_Category
In [22]:
data.corr()
Out[22]:
Attrition_Flag Customer_Age Months_on_book Credit_Limit Total_Revolving_Bal Avg_Open_To_Buy Total_Amt_Chng_Q4_Q1 Total_Trans_Amt Total_Trans_Ct Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio
Attrition_Flag 1.000000 -0.018347 -0.013839 0.023802 0.263093 0.000214 0.131937 0.168830 0.372064 0.296229 0.178513
Customer_Age -0.018347 1.000000 0.788969 0.002562 0.014962 0.001220 -0.065122 -0.046211 -0.066640 -0.017372 0.007162
Months_on_book -0.013839 0.788969 1.000000 0.007396 0.008458 0.006636 -0.052189 -0.038325 -0.049384 -0.018705 -0.007406
Credit_Limit 0.023802 0.002562 0.007396 1.000000 0.042275 0.995982 0.010349 0.171933 0.076270 -0.002649 -0.482891
Total_Revolving_Bal 0.263093 0.014962 0.008458 0.042275 1.000000 -0.047372 0.058327 0.064502 0.056263 0.092613 0.624255
Avg_Open_To_Buy 0.000214 0.001220 0.006636 0.995982 -0.047372 1.000000 0.005119 0.166112 0.071209 -0.010950 -0.538737
Total_Amt_Chng_Q4_Q1 0.131937 -0.065122 -0.052189 0.010349 0.058327 0.005119 1.000000 0.043627 0.011994 0.370628 0.038394
Total_Trans_Amt 0.168830 -0.046211 -0.038325 0.171933 0.064502 0.166112 0.043627 1.000000 0.807198 0.092439 -0.083174
Total_Trans_Ct 0.372064 -0.066640 -0.049384 0.076270 0.056263 0.071209 0.011994 0.807198 1.000000 0.124027 0.002637
Total_Ct_Chng_Q4_Q1 0.296229 -0.017372 -0.018705 -0.002649 0.092613 -0.010950 0.370628 0.092439 0.124027 1.000000 0.077384
Avg_Utilization_Ratio 0.178513 0.007162 -0.007406 -0.482891 0.624255 -0.538737 0.038394 -0.083174 0.002637 0.077384 1.000000
In [23]:
plt.subplots(figsize=(10,10)) 
sns.heatmap(data.corr(), annot =True, linewidth=1)
Out[23]:
<matplotlib.axes._subplots.AxesSubplot at 0x1441ebe5208>
In [24]:
sns.pairplot(data)
Out[24]:
<seaborn.axisgrid.PairGrid at 0x1441d1e7d88>
In [25]:
for i in small_num_col:
    plt.figure(figsize=(10,5))
    sns.violinplot(x=data['Attrition_Flag'], y=data[i])
    plt.show()
In [26]:
for i in cat_col:
    sns.countplot(data['Attrition_Flag'], hue = data[i])
    plt.show()

Bivariate Conclusion:

  1. revolving balance below 500 leads to more cancellation of credit card.
  2. less than 60 frequent use of credit card leads to more cancellation of credit card.
  3. lower than 4k of total transaction leads to more cancellation of credit card.
  4. gender, income, marital status, and education level doesn't seem to effect cancelling credit cards more.

Data Split

In [27]:
X = data.drop(['Attrition_Flag'],axis=1)
X = pd.get_dummies(X,drop_first=True)
y = data['Attrition_Flag']
In [28]:
x_train, x_test, y_train, y_test =train_test_split(X, y, test_size=0.3, random_state=1,stratify=y)
print(x_train.shape, x_test.shape)
(7085, 50) (3037, 50)
In [29]:
print(f"Class distritution: {y.value_counts(1)}")
print("")
print("Test Class distritution: {y_test.value_counts(1)}") # good distribution
Class distritution: 1    0.839261
0    0.160739
Name: Attrition_Flag, dtype: float64

Test Class distritution: {y_test.value_counts(1)}

Function Implementaion

In [30]:
def get_score_acc_rec_prec(model):
    '''
    model : Classifier to predict
    [0:train_acc,1:test_acc,2:train_recall,3:test_recall,4:train_precision,5test_precision]
    '''
    
    list = [] # defining an empty list to store all scores
    
    pred_train = model.predict(x_train)
    pred_test = model.predict(x_test)

    #Create Scores
    train_acc = model.score(x_train,y_train)
    test_acc = model.score(x_test,y_test)
    train_recall = metrics.recall_score(y_train,pred_train)
    test_recall = metrics.recall_score(y_test,pred_test)
    train_precision = metrics.precision_score(y_train,pred_train)
    test_precision = metrics.precision_score(y_test,pred_test)
    
    # Add accuracy in the list
    list.append(train_acc) 
    list.append(test_acc)
    list.append(train_recall)
    list.append(test_recall)
    list.append(train_precision)
    list.append(test_precision)
    
    return list # returning the list with train and test scores
In [31]:
def print_score(list,what="all"):
    ''''''
    if what == "acc":
        print("Accuracy on training set : ", list[0])
        print("Accuracy on test set : ", list[1])
    elif what == "rec":
        print("Recall on training set : ", list[2])
        print("Recall on test set : ", list[3])
    elif what == "prec":
        print("Precision on training set : ", list[4])
        print("Precision on test set : ", list[5])
    else:
        print("Accuracy on training set : ", list[0])
        print("Accuracy on test set : ", list[1])
        print("Recall on training set : ", list[2])
        print("Recall on test set : ", list[3])
        print("Precision on training set : ", list[4])
        print("Precision on test set : ", list[5])
    
In [32]:
def make_cm(model,y_actual,labels=[1, 0]):
    '''
    model : classifier to predict
    y_actual : ground truth (y_test)
    
    '''
    
    y_predict = model.predict(x_test)
    cm=metrics.confusion_matrix( y_actual, y_predict, labels=[1, 0])
    df_cm = pd.DataFrame(cm, index = [i for i in ["1","0"]],
                  columns = [i for i in ["Predict 1","Predict 0"]])
    
#     annotation=[cm]
# [['1116\n76.18%' '73\n4.98%']
#  ['91\n6.21%' '185\n12.63%']]
    sns.heatmap(df_cm, annot=True, fmt='')

Model Building

Logistic Regression

In [33]:
lg = LogisticRegression(random_state=1)
lg.fit(x_train,y_train)
Out[33]:
LogisticRegression(random_state=1)
In [34]:
make_cm(lg,y_test)
In [35]:
lg_score = get_score_acc_rec_prec(lg)
print_score(lg_score,'all')
Accuracy on training set :  0.8841213832039521
Accuracy on test set :  0.8755350675008232
Recall on training set :  0.9744365960309451
Recall on test set :  0.9701843860337387
Precision on training set :  0.8964876992108928
Precision on test set :  0.8911711711711712

Bagging:

Decision Tree

In [36]:
dTree = DecisionTreeClassifier(criterion='gini',random_state=1)
dTree.fit(x_train, y_train)
Out[36]:
DecisionTreeClassifier(random_state=1)
In [37]:
make_cm(dTree,y_test)
In [38]:
dTree_score = get_score_acc_rec_prec(dTree)
print_score(dTree_score,'all')
Accuracy on training set :  1.0
Accuracy on test set :  0.928547909120843
Recall on training set :  1.0
Recall on test set :  0.955276579050608
Precision on training set :  1.0
Precision on test set :  0.9594168636721828

Random Forest

In [39]:
rForest = RandomForestClassifier(random_state=1)
rForest.fit(x_train,y_train)
Out[39]:
RandomForestClassifier(random_state=1)
In [40]:
make_cm(rForest,y_test)
In [41]:
rForest_score = get_score_acc_rec_prec(rForest)
print_score(rForest_score,'all')
Accuracy on training set :  1.0
Accuracy on test set :  0.947645702996378
Recall on training set :  1.0
Recall on test set :  0.9862691251471165
Precision on training set :  1.0
Precision on test set :  0.9529946929492039

Bagging Classifier

In [42]:
bClassifier = BaggingClassifier(random_state=1)
bClassifier.fit(x_train,y_train)
Out[42]:
BaggingClassifier(random_state=1)
In [43]:
make_cm(bClassifier,y_test)
In [44]:
bClassifier_score = get_score_acc_rec_prec(bClassifier)
print_score(bClassifier_score,'all')
Accuracy on training set :  0.998165137614679
Accuracy on test set :  0.9529140599275601
Recall on training set :  0.9986545576858392
Recall on test set :  0.9717536288740682
Precision on training set :  0.9991586740703349
Precision on test set :  0.9721350078492935

Boosting:

AdaBoost Classifier

In [45]:
adaBoost = AdaBoostClassifier(random_state=1)
adaBoost.fit(x_train,y_train)
Out[45]:
AdaBoostClassifier(random_state=1)
In [46]:
make_cm(adaBoost,y_test)
In [47]:
adaBoost_score = get_score_acc_rec_prec(adaBoost)
print_score(adaBoost_score,'all')
Accuracy on training set :  0.9630204657727593
Accuracy on test set :  0.9578531445505433
Recall on training set :  0.9833501513622603
Recall on test set :  0.979599843075716
Precision on training set :  0.9728785357737105
Precision on test set :  0.9704624951418578

Gradient Boosting Classifier

In [48]:
gBoost = GradientBoostingClassifier(random_state=1)
gBoost.fit(x_train,y_train)
Out[48]:
GradientBoostingClassifier(random_state=1)
In [49]:
make_cm(gBoost,y_test)
In [50]:
gBoost_score = get_score_acc_rec_prec(gBoost)
print_score(gBoost_score,'all')
Accuracy on training set :  0.9751587861679605
Accuracy on test set :  0.9618044122489299
Recall on training set :  0.9922637066935756
Recall on test set :  0.9862691251471165
Precision on training set :  0.978441127694859
Precision on test set :  0.96878612716763

XGBoost Classifier

In [51]:
xgBoost = XGBClassifier(random_state=1, eval_metric='logloss')
xgBoost.fit(x_train,y_train)
Out[51]:
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, eval_metric='logloss',
              gamma=0, gpu_id=-1, importance_type='gain',
              interaction_constraints='', learning_rate=0.300000012,
              max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,
              monotone_constraints='()', n_estimators=100, n_jobs=8,
              num_parallel_tree=1, random_state=1, reg_alpha=0, reg_lambda=1,
              scale_pos_weight=1, subsample=1, tree_method='exact',
              validate_parameters=1, verbosity=None)
In [52]:
make_cm(xgBoost,y_test)
In [53]:
xgBoost_score = get_score_acc_rec_prec(xgBoost)
print_score(xgBoost_score,'all')
Accuracy on training set :  1.0
Accuracy on test set :  0.9680605861047086
Recall on training set :  1.0
Recall on test set :  0.9843075715967046
Precision on training set :  1.0
Precision on test set :  0.9777864380358535

Compare Models

In [54]:
#list of all the models
modelList = [lg,dTree,rForest,bClassifier,adaBoost,gBoost,xgBoost]

acc_train = []
acc_test = []
rec_train = []
rec_test = []
prec_train = []
prec_test = []

for model in modelList:
    score = get_score_acc_rec_prec(model)
    acc_train.append(score[0])
    acc_test.append(score[1])
    rec_train.append(score[2])
    rec_test.append(score[3])
    prec_train.append(score[4])
    prec_test.append(score[5])
    
In [55]:
comparison_frame = pd.DataFrame({'Model':['Logistic Regression','Decision Tree','Random Forest','Bagging Classifier',
                                          'AdaBoost Classifier','Gradient Boosting Classifier','XGBoost Classifier'], 
                                          'TrainAccuracy': acc_train,'TestAccuracy': acc_test,
                                          'TrainRecall': rec_train,'TestRecall': rec_test,
                                          'TrainPrecision': prec_train,'TestPrecision': prec_test}) 


comparison_frame
Out[55]:
Model TrainAccuracy TestAccuracy TrainRecall TestRecall TrainPrecision TestPrecision
0 Logistic Regression 0.884121 0.875535 0.974437 0.970184 0.896488 0.891171
1 Decision Tree 1.000000 0.928548 1.000000 0.955277 1.000000 0.959417
2 Random Forest 1.000000 0.947646 1.000000 0.986269 1.000000 0.952995
3 Bagging Classifier 0.998165 0.952914 0.998655 0.971754 0.999159 0.972135
4 AdaBoost Classifier 0.963020 0.957853 0.983350 0.979600 0.972879 0.970462
5 Gradient Boosting Classifier 0.975159 0.961804 0.992264 0.986269 0.978441 0.968786
6 XGBoost Classifier 1.000000 0.968061 1.000000 0.984308 1.000000 0.977786

Choose Random Forest, AdaBosting, xg Boosting.
(Precision is more important in this case)
Reasons:
Want to see the Hyperparameter tuning for Random Forest and see if it can drastically improve.
xgboosting and adaboosting has higher performance without clear overfitting.

Random Forest HyperParameter Tuning

GridSearch

In [56]:
param_grid_rForest = {"max_depth": [2, None],
              "max_features": [1, 3, 10],
              "min_samples_split": [2, 3, 8],
              "min_samples_leaf": [1, 3, 7],
              "bootstrap": [True, False],
              "criterion": ["gini", "entropy"]}
In [57]:
%%time
grid_search_rForest = GridSearchCV(rForest, param_grid=param_grid_rForest)
grid_search_rForest.fit(X, y)
Wall time: 9min 14s
Out[57]:
GridSearchCV(estimator=RandomForestClassifier(random_state=1),
             param_grid={'bootstrap': [True, False],
                         'criterion': ['gini', 'entropy'],
                         'max_depth': [2, None], 'max_features': [1, 3, 10],
                         'min_samples_leaf': [1, 3, 7],
                         'min_samples_split': [2, 3, 8]})
In [65]:
print(f"Best Parameters:{grid_search_rForest.best_params_} \nScore: {grid_search_rForest.best_score_}")
Best Parameters:{'bootstrap': False, 'criterion': 'entropy', 'max_depth': None, 'max_features': 10, 'min_samples_leaf': 3, 'min_samples_split': 8} 
Score: 0.9246205045625334
In [69]:
#grid_search_rForest.cv_results_['mean_test_score']
#grid_search_rForest.best_estimator_

RandomSearch

In [70]:
param_dist_rForest = {"max_depth": [2, None],
              "max_features": sp_randint(1, 12),
              "min_samples_split": sp_randint(2, 12),
              "min_samples_leaf": sp_randint(1, 12),
              "bootstrap": [True, False],
              "criterion": ["gini", "entropy"]}
In [71]:
%%time
random_search_rForest= RandomizedSearchCV(rForest, param_distributions=param_dist_rForest, n_iter=10) #default cv = 3
random_search_rForest.fit(X, y)
Wall time: 22.5 s
Out[71]:
RandomizedSearchCV(estimator=RandomForestClassifier(random_state=1),
                   param_distributions={'bootstrap': [True, False],
                                        'criterion': ['gini', 'entropy'],
                                        'max_depth': [2, None],
                                        'max_features': <scipy.stats._distn_infrastructure.rv_frozen object at 0x000001442F87C208>,
                                        'min_samples_leaf': <scipy.stats._distn_infrastructure.rv_frozen object at 0x000001442F8861C8>,
                                        'min_samples_split': <scipy.stats._distn_infrastructure.rv_frozen object at 0x000001442F87C708>})
In [72]:
print(f"Best Parameters:{random_search_rForest.best_params_} \nScore: {random_search_rForest.best_score_}")
Best Parameters:{'bootstrap': False, 'criterion': 'gini', 'max_depth': None, 'max_features': 5, 'min_samples_leaf': 3, 'min_samples_split': 6} 
Score: 0.9143455326208949

AdaBosting HyperParameter Tuning

GridSearch

In [73]:
%%time 
pipeline = make_pipeline(StandardScaler(), AdaBoostClassifier(random_state=1))

param_grid_adaBoost = {
    "adaboostclassifier__n_estimators": np.arange(10, 80, 10),
    "adaboostclassifier__learning_rate": [0.01,0.05,0.1,0.3,1],
    "adaboostclassifier__base_estimator": [
        DecisionTreeClassifier(max_depth=1, random_state=1),
        DecisionTreeClassifier(max_depth=2, random_state=1),
        DecisionTreeClassifier(max_depth=3, random_state=1),
    ],
}

# Use Accuracy since both recall and precision are equally important in reality.
scorer_adaBoost = metrics.make_scorer(metrics.accuracy_score)

grid_search_adaBoost = GridSearchCV(estimator=pipeline, param_grid=param_grid_adaBoost, 
                                    scoring=scorer_adaBoost, cv=5, n_jobs = -1)
grid_search_adaBoost.fit(x_train, y_train)

print(f"Best Parameters:{grid_search_adaBoost.best_params_} \nScore: {grid_search_adaBoost.best_score_}")
Best Parameters:{'adaboostclassifier__base_estimator': DecisionTreeClassifier(max_depth=2, random_state=1), 'adaboostclassifier__learning_rate': 0.3, 'adaboostclassifier__n_estimators': 60} 
Score: 0.9645730416372619
Wall time: 1min 6s

RandomSearch

In [74]:
%%time
pipeline1 = make_pipeline(StandardScaler(), AdaBoostClassifier(random_state=1))

param_dist_adaBoost = {
    "adaboostclassifier__n_estimators": np.arange(10, 80, 10),
    "adaboostclassifier__learning_rate": [0.01,0.05,0.1,0.3,1],
    "adaboostclassifier__base_estimator": [
        DecisionTreeClassifier(max_depth=1, random_state=1),
        DecisionTreeClassifier(max_depth=2, random_state=1),
        DecisionTreeClassifier(max_depth=3, random_state=1),
    ],
}
# Use Accuracy since both recall and precision are equally important in reality.
scorer_adaBoost1 = metrics.make_scorer(metrics.accuracy_score)
random_search_adaBoost = RandomizedSearchCV(estimator=pipeline1, param_distributions=param_dist_adaBoost, n_iter=50, 
                                scoring=scorer_adaBoost1, cv=5, random_state=1)
random_search_adaBoost.fit(x_train,y_train)

print(f"Best parameters are {random_search_adaBoost.best_params_} \nscore={random_search_adaBoost.best_score_}:")
Best parameters are {'adaboostclassifier__n_estimators': 70, 'adaboostclassifier__learning_rate': 0.3, 'adaboostclassifier__base_estimator': DecisionTreeClassifier(max_depth=2, random_state=1)} 
score=0.9642907551164432:
Wall time: 2min 19s

XGBoost HyperParameter Tuning

GridSearch

In [ ]:
%%time 
pipeline2=make_pipeline(StandardScaler(), XGBClassifier(random_state=1,eval_metric='logloss'))

param_grid_XGBoost={
    'xgbclassifier__n_estimators':np.arange(50,250,50),'xgbclassifier__scale_pos_weight':[0,1,3,7],
    'xgbclassifier__learning_rate':[0.01,0.05,0.1,0.2], 'xgbclassifier__gamma':[0,2,5],
    'xgbclassifier__subsample':[0.7,0.85,1]
}

scorer_XGBoost = metrics.make_scorer(metrics.accuracy_score)
grid_search_XGBoost = GridSearchCV(estimator=pipeline2, param_grid=param_grid_XGBoost, 
                                   scoring=scorer_XGBoost, cv=5, n_jobs = -1)
grid_search_XGBoost.fit(x_train,y_train)

print(f"Best parameters are {grid_search_XGBoost.best_params_} \nscore={grid_search_XGBoost.best_score_}")

RandomSearch

In [ ]:
%%time 
pipeline3=make_pipeline(StandardScaler(),XGBClassifier(random_state=1,eval_metric='logloss', n_estimators = 50))

#Parameter grid to pass in RandomizedSearchCV
param_dist_XGBoost={
    'xgbclassifier__n_estimators':np.arange(50,250,50),
    'xgbclassifier__scale_pos_weight':[0,1,3,7],
    'xgbclassifier__learning_rate':[0.01,0.05,0.1,0.2],
    'xgbclassifier__gamma':[0,2,5],
    'xgbclassifier__subsample':[0.7,0.85,1],
    'xgbclassifier__max_depth':np.arange(1,8,1),
    'xgbclassifier__reg_lambda':[0,1,3,7]
}

scorer_XGBoost1 = metrics.make_scorer(metrics.accuracy_score)
random_search_XGBoost = RandomizedSearchCV(estimator=pipeline3, param_distributions=param_dist_XGBoost, n_iter=50, 
                                   scoring=scorer_XGBoost1, cv=5, random_state=1)
random_search_XGBoost.fit(x_train,y_train)

print(f"Best parameters are {random_search_XGBoost.best_params_} with CV score={random_search_XGBoost.best_score_}:")

Comparing the performance:

It seems like the XGBoost has the best accuracy. However, it is very important to know, with big data, the time complexity can be very costly. Recommend using the second best model to update more often. (24min XGBoost with <1% higher accuracy vs 2 min adaboost with good enough accuracy)

Conclusion & Insights:

Since the purpose of the project is to find the customers who are likely to stop using credit card, it is also important to check the customers' behaviors. From EDA, it became more clear that:

  • revolving balance below 500 leads to more cancellation of credit card.
  • less than 60 frequent use of credit card leads to more cancellation of credit card.
  • lower than 4k of total transaction leads to more cancellation of credit card.
  • gender, income, marital status, and education level doesn't actively effect cancelling credit cards.

Consider the following business recommendations:

  • call the customer status when their revolving balance is too low or going negative. and see how bank can help
  • run more promotions to increase the natural credit card usage
  • run a survey from customers and also notify them cancelling credit card temporarily negatively affects credit score.
In [ ]: